12 July, 2018

Estructura de la presentación

  • Motivation

  • PPforest, Projection pursuit random forest

  • Visually exploring a PPforest object

  • Final comments

Motivation

  • Supervise learning: objective is to predict a categorical variable then it is a classification problem (two class or multiclass problem)

  • PPforest is a new supervised method based on bagged projection pursuit trees for classification problems.

  • This method improves the predictive performance when the separation between classes is in combinations of variables.

Motivation

Black box model, having better tools to open up black box models will provide for better understanding the data, the model strengths and weaknesses, and how the model will performs on future data.

Ensemble models

  • Ensembles learning methods: combined multiple individual models trained independently to build a prediction model.

  • Some well known examples of ensemble learning methods are, boosting (Schapire 1990), bagging (Breiman 1996) and random forest (Breiman 2001) among others.

  • Main differences between ensembles, type of individual models to be combined and the ways these individual models are combined.

PPforest

PPforest is an ensemble learning method, built on bagged trees.

Main concepts:

  • Bootstrap aggregation (Breiman (1996) and Breiman and others (1996))

  • Random feature selection (Amit and Geman (1997) and Ho (1998)) to individual classification trees for prediction.

PPforest, individual classifiers

The individual classifier in PPforest is a PPtree (Lee et al. 2005).

The splits in PPforest are based on a linear combination of randomly chosen variables. Utilizing linear combinations of variables the individual model (PPtree) separates classes taking into account the correlation between variables.

PPtree, individual classifier for PPforest

PPtree combines tree structure methods with projection pursuit dimension reduction. PPtree treats the data always as a two-class system.

When the classes are more than two the algorithm uses a two step projection pursuits optimization in every node split.

PPtree: Illustration of PPtree algorithm

CART vs PPtree, simulated data

PPforest Illustration

Implementation

  • PPforest is on CRAN

  • Initial version was develop entirely in R, not fast enought

  • Two code optimization strategies were employed:
    • translate the main functions to Rcpp
    • parallelization

PPforest Diagnostics

  • OOB Error rate

  • Variable importance

  • Proximity matrix

  • Vote matrix

Visually diagnosing forest classifiers

Structuring data and constructing plots to explore forest classification models interactively.

We proposed a method to explore and diagnostic ensemble classifiers based on three levels of analysis:

  1. Individual cases (Observations)
  2. Individual models (Trees)
  3. Performance comparison (PPF vs RF)

Key part of the visualization is the use of interactive visualization methods.

Interactive web-based visualization of the ensemble methods.

Interactive graphics

Two key components that an interactive graphic should accomplish:

  • Interactions in each visualization
  • Links between different graphics

Why we should use interactive visualizations?

To see connections inside each level that cannot be seen in a static graphs.

Interactive graphics

  • Links at case level allows to identify cases where the model is not working properly and allows to characterize this case base on the original data.

  • Also we can identify individual models in the ensemble that are not good enough and why this is happening.

  • The last level of analysis focused on model comparison based on predicted performance by class.

Fishcatch data example

159 fishes of 7 species (Bream, Parkki, Perch, Pike, Roach, Smelt and Whitewish) are caught and measured, 6 variables.

Varible Description
weight Weight of the fish (in grams)
length1 Length from the nose to the beginning of the tail (in cm)
length2 Length from the nose to the notch of the tail (in cm)
length3 Length from the nose to the end of the tail (in cm)
height Maximal height as % of Length3
width Maximal width as % of Length3

Panel 1: Individual cases

Example: Fishcatch

Proximity Matrix

Vote matrix, Jittered side-by-side dot plot

Vote matrix, ternary plot

Panel 1: App for individual cases

Panel 2: Individual models

Panel 2: App for individual models

Panel 3: Performance comparison (PPF vs RF)

Panel 3: App for performance comparison (PPF vs RF)

Final comments

  1. Having better tools to open up black box models will provide for better understanding the data, the model strengths and weaknesses, and how the model will performs on future data.

  2. This visualisation app provides a selection of interactive plots to diagnose PPF models.

  3. This shell could be used to make an app for other ensemble classifiers.

  4. Combining shiny, ggplot2 and plotly we can develop informative interactive visualizations

Information

Bibliografía

Amit, Yali, and Donald Geman. 1997. “Shape Quantization and Recognition with Randomized Trees.” Neural Computation 9 (7). MIT Press:1545–88.

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer:123–40.

———. 2001. “Random Forests.” Machine Learning 45 (1). Springer:5–32.

Breiman, Leo, and others. 1996. “Heuristics of Instability and Stabilization in Model Selection.” The Annals of Statistics 24 (6). Institute of Mathematical Statistics:2350–83.

Ho, Tin Kam. 1998. “The Random Subspace Method for Constructing Decision Forests.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 20 (8). IEEE:832–44.

Lee, Eun-Kyung, Dianne Cook, Sigbert Klinke, and Thomas Lumley. 2005. “Projection Pursuit for Exploratory Supervised Classification.” Journal of Computational and Graphical Statistics 14 (4).

Schapire, Robert E. 1990. “The Strength of Weak Learnability.” Machine Learning 5 (2). Springer:197–227.